85 research outputs found
Duration modeling with expanded HMM applied to speech recognition
The occupancy of the HMM states is modeled by means of a Markov chain. A linear estimator is introduced to compute the probabilities of the Markov chain. The distribution function (DF) represents accurately the observed data. Representing the DF as a Markov chain allows the use of standard HMM recognizers. The increase of complexity is negligible in training and strongly limited during recognition. Experiments performed on acoustic-phonetic decoding shows how the phone recognition rate increases from 60.6 to 61.1. Furthermore, on a task of database inquires, where phones are used as subword units, the correct word rate increases from 88.2 to 88.4.Peer ReviewedPostprint (published version
Ponderación ML de parámetros en un sistema de reconocimiento de palabras basado en CDHMM
Speech dynamic feature are routinely used in current speech recognition systems in combination with short-term (static) spectral features. The aim of this paper is to propose a method to automatically estimate the optimum ponderation of static and dynamic features in a speech recognition system. The recognition system considered in this paper is based on Continuous-Density Hidden Markov Modelling (CDHMM) widely used in speech recognition. Our approach consists basically in 1) adding two new parameters for each state of each model that weight both kinds of speech features, and 2) estimating those parameters by means of a Maximum Likelihood training. Experimental results in speaker independent digit recognition show an important increase of recognition accuracy.Peer ReviewedPostprint (published version
Multidialectal acoustic modeling: a comparative study
In this paper, multidialectal acoustic modeling based on shar-
ing data across dialects is addressed. A comparative study of
different methods of combining data based on decision tree
clustering algorithms is presented. Approaches evolved differ
in the way of evaluating the similarity of sounds between di-
alects, and the decision tree structure applied. Proposed systems
are tested with Spanish dialects across Spain and Latin Amer-
ica. All multidialectal proposed systems improve monodialectal
performance using data from another dialect but it is shown that
the way to share data is critical. The best combination between
similarity measure and tree structure achieves an improvement
of 7% over the results obtained with monodialectal systems.Peer ReviewedPostprint (published version
An adaptive gradient-search based algorithm for discriminative training of hmm's
Although having revealed to be a very powerful tool in acoustic modelling, discriminative training presents a major drawback: the lack of a formulation guaranteeing convergence in no matter which initial conditions, such as the Baum-Welch algorithm in maximum likelihood training. For this reason, a gradient descent search is usually used in this kind of problem. Unfortunately, standard gradient descent algorithms rely heavily on the election of the learning rates. This dependence is specially cumbersome because it represents that, at each run of the discriminative training procedure, a search should be carried out over the parameters ruling the algorithm. In this paper we describe an adaptive procedure for determining the optimal value of the step size at each iteration. While the calculus and memory overhead of the algorithm is negligible, results show less dependence on the initial learning rate than standard gradient descent and, using the same idea in order to apply self-scaling, it clearly outperforms it.Peer ReviewedPostprint (published version
First experiments on an HMM based double layer framework for automatic continuous speech recognition
The usual approach to automatic continuous speech recognition is what can be called the acoustic-phonetic modelling approach. In this approach, voice is considered
to hold two different kinds of information acoustic and phonetic . Acoustic information is represented by some kind of feature extraction out of the voice signal, and phonetic information is extracted from the vocabulary of the task by means of a lexicon or some other procedure. The
main assumption in this approach is that models can be constructed that capture the correlation existing between
both kinds of information.
The main limitation of acoustic-phonetic modelling in speech recognition is its poor treatment of the variability
present both in the phonetic level and the acoustic one. In this paper, we propose the use of a slightly modified framework where the usual acoustic-phonetic modelling
is divided into two different layers: one closer to the voice signal, and the other closer to the phonetics of the sentence. By doing so we expect an improvement of
the modelling accuracy, as well as a better management of acoustic and phonetic variability. Experiments carried out so far, using a very simpli ed version of the proposed framework, show a signi cant improvement in the recognition of a large vocabulary continuous speech task, and represent a promising start point for
future research.Peer ReviewedPostprint (published version
- …